Skip to content

batches author lookups in imports#3126

Open
gabestein wants to merge 2 commits intomainfrom
gs/batch-author-import
Open

batches author lookups in imports#3126
gabestein wants to merge 2 commits intomainfrom
gs/batch-author-import

Conversation

@gabestein
Copy link
Copy Markdown
Member

@gabestein gabestein commented Jul 11, 2024

Issue(s) Resolved

Resolves #2795

Test Plan

Screenshots (if applicable)

Optional

Notes/Context/Gotchas

Supporting Docs

@pubpubBot pubpubBot temporarily deployed to pubpub-pipel-gs-batch-a-znoshx July 11, 2024 20:27 Inactive
@pubpubBot pubpubBot temporarily deployed to pubpub-pipel-gs-batch-a-znoshx July 12, 2024 19:22 Inactive
@gabestein gabestein closed this Apr 15, 2026
@gabestein gabestein reopened this Apr 15, 2026
@isTravis isTravis force-pushed the gs/batch-author-import branch from 80face9 to 1d6ffa3 Compare April 15, 2026 17:40
@isTravis
Copy link
Copy Markdown
Member

Pushed a commit with an updated approach that adapts to the changes made to main since this PR was made.

Problem

When importing a document with many authors, getAttributions in workers/tasks/import/metadata.ts fires N individual getSearchUsers database queries (one per author name) via Promise.all. Each query runs an ILIKE '%name%' scan on the User table. With many authors this causes DB connection contention and can exceed the 2-minute worker timeout.

Original PR Approach (July 2024)

The original branch (gs/batch-author-import) had two commits:

  1. Throttled queries with asyncMap — Replaced Promise.all + .map(async ...) with asyncMap at concurrency: 2, so only 2 author lookups run at a time instead of all at once.
  2. Bumped the global worker timeout — Changed maxWorkerTimeSeconds from 120 → 300, affecting all worker tasks (export, import, archive).

This reduced DB pressure but still issued N individual queries — just slower. And the timeout change was a blunt instrument.

Updated Approach (April 2026)

Three targeted changes that batch at the SQL level rather than throttle at the application level:

1. New batchSearchUsers function — server/search/queries.ts

Added a new export that takes an array of author names and issues a single User.findAll query with all names combined into one OR clause. Results are mapped back per-name in JS using the same matching logic as the original per-name query (fullName contains, slug contains, email exact match). The existing getSearchUsers is left untouched for use by the search API.

2. Rewritten getAttributionsworkers/tasks/import/metadata.ts

Instead of Promise.all over N async lookups, the function now:

  • Filters authorEntries to collect all string names upfront
  • Calls batchSearchUsers once with the full list
  • Synchronously maps results back into the attribution objects

No concurrency management needed since it's a single query.

3. Per-task timeout — workers/queue.ts

Added import: 300 (5 minutes) to the existing customTimeouts map. This gives imports more headroom for large documents without affecting other task types. The codebase already had this mechanism in place for the archive task.

Key Insight

One SQL query with 50 names in its WHERE clause is far cheaper than 50 individual queries run 2 at a time. Batching at the database level eliminates the need for application-level concurrency control entirely.

@isTravis isTravis marked this pull request as ready for review April 15, 2026 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cannot import LaTeX with many authors

3 participants